Using Large Page and Processor Binding to Optimize the Performance of OpenMP Scientific Applications on an IBM POWER5+ System

نویسندگان

  • Xingfu Wu
  • Valerie Taylor
چکیده

Multicores are widely used for high performance computing and are being configured in a hierarchical manner to compose a multicore system. While this presents significant new opportunities, such as high inter-core bandwidth and low inter-core latency, it also presents new challenges in the form of inter-core resource conflict and contention. A challenge to be addressed is how well current shared-memory parallel programming paradigms, such as OpenMP, exploit the potential offered by such a multicore system for scientific applications. In this paper, we analyze the performance of OpenMP scientific applications such as NAS parallel benchmarks, an OpenMP Matrix multiplication, and a large-scale scientific application: a 3D particle-in-cell application Gyrokinetic Toroidal Code (GTC) in magnetic fusion on an IBM POWER5+ system, and use large page and processor binding to significantly optimize the OpenMP performance for no requirement of any program modifications. Our experimental results show that using the large page of 64KB on the multicore system results in up to 33.92% performance improvement for OpenMP NAS parallel benchmarks, and the OpenMP benchmarks benefited more from the large page than their MPI counterparts; using the large page of 64KB and processor binding results in up to 67.18% performance improvement for matrix multiplications, and up to 10.13% performance improvement for the GTC. Our results also indicate that the OpenMP performance can be improved by using conventional loop optimization techniques such as blocking and unrolling inside the OpenMP parallel regions.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimizing DDA Code on a POWER5 Processor

In this paper we take an existing scientific computation code, DDA, and optimize it to run on an IBM Power5 processor. The DDA code, originally developed by a Ph.D. candidate in physics, suffers from excessive execution time caused by a high number of cache accesses and a low rate of instructions per cycle. Our goal is to improve the code’s performance by making a series of optimizations in a s...

متن کامل

Performance modeling of hybrid MPI/OpenMP scientific applications on large-scale multicore supercomputers

In this paper, we present a performance modeling framework based on memory bandwidth contention time and a parameterized communication model to predict the performance of OpenMP, MPI and hybrid applications with weak scaling on three large-scale multicore supercomputers: IBM POWER4, POWER5+ and BlueGene/P, and analyze the performance of these MPI, OpenMP and hybrid applications. We use STREAM m...

متن کامل

Parallel Simulations of Dynamic Earthquake Rupture Along Geometrically Complex Faults on CMP Systems

Chip multiprocessors (CMP) are widely used for high performance computing and are being configured in a hierarchical manner to compose a CMP compute node in a CMP system. Such a CMP system provides a natural programming paradigm for hybrid MPI/OpenMP applications. In this paper, we use OpenMP to parallelize a sequential earthquake simulation code for modeling spontaneous earthquake rupture alon...

متن کامل

Performance Evaluation of Scientific Applications on Modern Parallel Vector Systems

Despite their dominance of high-end computing (HEC) through the 1980’s, vector systems have been gradually replaced by microprocessorbased systems. However, while peak performance of microprocessor-based systems has grown exponentially, the gradual slide in sustained performance delivered to scientific applications has become a growing concern among HEC users. Recently, the Earth Simulator and ...

متن کامل

OSCAR API for Real-Time Low-Power Multicores and Its Performance on Multicores and SMP Servers

OSCAR (Optimally Scheduled Advanced Multiprocessor) API has been designed for real-time embedded low-power multicores to generate parallel programs for various multicores from different vendors by using the OSCAR parallelizing compiler. The OSCAR API has been developed by Waseda University in collaboration with Fujitsu Laboratory, Hitachi, NEC, Panasonic, Renesas Technology, and Toshiba in an M...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009